Skip to content

Reduce GOP decoder stream synchronizations#33

Merged
xupinjie merged 1 commit into
NVIDIA:mainfrom
KazukiKomon:zhy/reduce-gop-decoder-sync
Jun 12, 2026
Merged

Reduce GOP decoder stream synchronizations#33
xupinjie merged 1 commit into
NVIDIA:mainfrom
KazukiKomon:zhy/reduce-gop-decoder-sync

Conversation

@KazukiKomon

Copy link
Copy Markdown
Contributor

Summary

  • Enable async/event-backed NvDecoder frame copies for GOP decoding.
  • Move raw YUV output copy onto the decoder stream.
  • Synchronize once at DecProc completion instead of synchronizing inside each displayed frame copy.

Verification

  • git diff --check

Discussion

This draft mirrors the internal MR diff for discussion. One point to review is that enabling bEnableAsyncAllocations activates the existing async allocation path in NvDecoder.

@xupinjie

Copy link
Copy Markdown
Collaborator

/build

Comment on lines 640 to 644

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this param ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter enables NvDecoder async allocation/event mode. Without it, NvDecoder::HandlePictureDisplay synchronizes the decoder stream for each displayed frame; enabling it lets this path queue the displayed-frame copies and synchronize the decoder stream after DecProc has queued the output copies. I added a code comment at the constructor site to make that intent explicit.

// Copy the decode frames from device
CUDA_DRVAPI_CALL(cuMemcpyDtoD((CUdeviceptr)pFrame_buffer, (CUdeviceptr)pFrame, decoder->GetFrameSize()));
// Keep the decode-buffer copy on the decoder stream. DecProc performs one
// terminal stream sync before returning the Python-visible frame.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"one terminal stream sync before returning the Python-visible frame" may be a misleading. There are some other sync when decoder cuda api call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I changed the comment to avoid implying this is the only CUDA sync in the decoder path. It now only states that the raw output copy is queued on the decoder stream and that DecProc synchronizes that stream before returning the Python-visible frame.

@KazukiKomon KazukiKomon marked this pull request as ready for review June 11, 2026 08:09
@KazukiKomon KazukiKomon force-pushed the zhy/reduce-gop-decoder-sync branch 2 times, most recently from 0acdde1 to 431e590 Compare June 11, 2026 08:12
@KazukiKomon

Copy link
Copy Markdown
Contributor Author

/build

1 similar comment
@xupinjie

Copy link
Copy Markdown
Collaborator

/build

@xupinjie

Copy link
Copy Markdown
Collaborator

PR #33bEnableAsyncAllocations=true: correct, but ~16% slower

Tested on H200 NVL, CUDA 12.9, against PR head g431e59076. Same tree, only the flag flipped.

Verdict: revert the flag to false. It's safe but gives no speed-up — it's a measurable
regression. The rest of the PR (cuMemcpyDtoDAsync + single terminal sync in DecProc) is fine.

Performance — ~16% regression

Per-Decode() for a 36-frame batch, 80 iters × 3 rounds:

Build min (ms) median (ms)
async=false (per-frame sync) 40.8 43.5
async=true (PR) 45.4 50.7

Distributions don't overlap (true min > false median).

Why (nsys, CUDA API call counts)

API bEnableAsyncAllocations =true bEnableAsyncAllocations =false
cuStreamSynchronize 1.4k 64k
cuEventRecord 126k 63k
cuMemcpy2DAsync / cuLaunchKernel identical identical
cuMemAllocAsync warm-up only

The flag just swaps ~63k per-frame cuStreamSynchronize for ~63k per-frame cuEventRecord — no
GPU work removed. The allocation change (its namesake) is irrelevant: buffers are pooled, so zero
allocs in steady state. Dropping the sync lets the host outrun a 2-surface pool, and the cost
reappears as blocking inside the enqueue calls.

@KazukiKomon KazukiKomon force-pushed the zhy/reduce-gop-decoder-sync branch from 431e590 to 91d4ed2 Compare June 11, 2026 09:44
@KazukiKomon

Copy link
Copy Markdown
Contributor Author

/build

@KazukiKomon

Copy link
Copy Markdown
Contributor Author

Updated in 91d4ed2: reverted bEnableAsyncAllocations back to false based on the benchmark result. The PR now only keeps the raw-path cuMemcpyDtoDAsync change plus the terminal decoder-stream sync in DecProc.

Signed-off-by: hongyizhang <805701948@qq.com>
@KazukiKomon KazukiKomon force-pushed the zhy/reduce-gop-decoder-sync branch from 91d4ed2 to 96dfdcc Compare June 11, 2026 10:07
@xupinjie

Copy link
Copy Markdown
Collaborator

Performance Analysis: cuMemcpyDtoDcuMemcpyDtoDAsync

Problem

GetYUVFromFrame previously used cuMemcpyDtoD, which targets the CUDA
legacy null stream. The null stream has implicit bidirectional synchronization
with any blocking CUDA stream: each copy must wait for all pending work on
blocking streams to drain, and vice versa. This causes per-frame decoder
latency to be gated by external CUDA workloads sharing the same context.

Additionally, the old DecProc had no stream sync after the last frame's
copy, meaning Decode() could return before the final frame's data was fully
written — a silent race condition.

The fix replaces cuMemcpyDtoD with cuMemcpyDtoDAsync(decoder->GetStream())
so copies are scoped to the decoder stream only, and adds a single
cuStreamSynchronize at the end of DecProc to guarantee all frames are
ready before returning.

Setup

  • GPU: NVIDIA H200 NVL
  • Workload: batch of 8 video files, 1 YUV frame each per Decode() call,
    40 iterations, median reported
  • Concurrent load: a background thread running continuous 2048² fp16
    matmuls, tested with two stream types

Results

Concurrent load Old (cuMemcpyDtoD) New (cuMemcpyDtoDAsync)
None 5.3 ms 5.5 ms
Non-blocking stream (torch.cuda.Stream()) 5.4 ms 5.5 ms
Blocking stream (cudaStreamCreate) 62.6 ms (+1083%) 5.4 ms (+0%)

Conclusion

  • Typical PyTorch training (streams are cudaStreamNonBlocking by
    default): no performance difference — the null stream global barrier does
    not apply to non-blocking streams.
  • Blocking stream workloads (legacy NCCL, cuBLAS global stream, custom
    CUDA extensions): old code inflates decoder latency ~10×; new code is
    unaffected.
  • Correctness: the new cuStreamSynchronize eliminates a race condition
    where the last frame's buffer could be read before its copy completed.

@xupinjie xupinjie merged commit edae948 into NVIDIA:main Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants